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Abstract 



We suggest that one (or a collection) of names of 
Yahoo! (or any other WWW indexer's) categories 
can be used to describe the content of a document. 
Such categories offer a standardized and universal 
way for referring to or describing the nature of real 
world objects, activities, documents and so on, and 
may be used (we suggest) to semantical ly charac- 
terize the content' of documents. WWW indices, 
like Yahoo ! provide a huge hierarchy of categories 
(topics) that touch every aspect of human endeav- 
ors. Such topics can be used as descriptors, sim- 
ilarly to the way librarians use for example, the 
Library of Congress cataloging system to annotate 
and categorize books. 

In the course of investigating this idea, we address 
the problem of automatic categorization of web- 
pages in the Yahoo! directory. We use Telltale as 
our classifier; Telltale uses n-grams to compute the 
similarity between documents. We experiment with 
various types of descriptions for the Yahoo! cate- 
gories and the webpages to be categorized. Our 
findings suggest that the best results occur when 
using the very brief descriptions of the Yahoo! cat- 
egorized entries; these brief descriptions are pro- 
vided either by the entries* submitters or by the Ya- 
hoo! human indexers and accompany most Yahoo!- 
indexed entries. 



1 Introduction 

People are very good at answering the question "what is this 
about?", where "this" might refer to a book, a newspaper ar- 
ticle, a publication, a webpage, etc., especially when "this" 
falls into an area of human knowledge or experience that they 
master. Because the beneficiaries of answers to such ques- 
tions are other people who possess a body of general knowl- 
edge and the mastery of a spoken language, they are not trou- 
bled by the (often) incomplete and non-standardized nature of 
the responses. Computer programs on the other hand could 
benefit from a standardized way for describing the content or 
the nature of "things" (of all things, we will focus on "things" 
of a textual form). The descriptions that we have in mind 
are not semantical ly deep descriptions of "things" but rather 
headline-like accounts of their nature that describe them in 
the broader context of human knowledge and experience. For 
example, Phantom of the Opera might be a Musical, or it 
might be a Musical, which is a form of Theater, which is 
a kind of a Performing Art, which in turn is something 
that has to do with the Arts; in other words, Phantom of 
the Opera is a Arts : Performing Arts : Theater : - 
Musicals kind of thing. 

Librarians have been arduously performing this task for 
centuries but the emergence of the World Wide Web (WWW) 
in recent years has led to the creation of huge indices that 
focus on categorizing selected webpages depending on their 
content. Yahoo! for example, is an attempt to organize web- 
pages into a hierarchical index of more than 150,000 cate- 
gories (topics). We suggest that a Yahoo! category (or a col- 
lection of them) can be used to describe the content of a doc- 
ument, the way Arts : Performing Arts : Theater : - 
Musicals, which is indeed a Yahoo! subcategory, can be 
used to refer to Phantom of the Opera or to describe a web- 
page about the musical Phantom of the Opera. If extended 
Markup Language (XML) lives up to the high expectations 
associated with it, one can imagine a tag like YahooCate- 
gory that can be introduced and supplement the XML source 
of a webpage, which in effect would describe how this par- 
ticular webpage could have been categorized in the Yahoo! 
hierarchy. 



Such a semantic annotation of documents would be useful, 
even if it has do be done manually, because it will offer a uni- 
form and universal way of referring to the content of a docu- 
ment. Of course, we need not limit ourselves to document de- 
scriptions. Although, for example, agents (human or software 
ones) can describe their interests, or their capabilities as col- 
lections of Yahoo! categories, our larger point is that Yahoo! 
categories can be used as a standardized way for referring to 
or describing the "nature" of things. On the other hand, suc- 
cessfully automating this process offers a whole new array 
of possibilities. To name a few, it will be easier to classify 
things into the Yahoo! (or any other WWW indexer's) hier- 
archy, search engines will have an easier task finding things 
if they are semantically annotated this way, spiders will be 
able to index a much larger part of the WWW, browsers can 
be more tuned to their users' particular interests (by tracking 
accessed documents), and so on. 

This paper presents some experiments that explore the au- 
tomation of the process of semantically annotating webpages 
via the use of Yahoo! categories as descriptors of their con- 
tent. So, the question we are addressing is: given some ran- 
dom webpage, if a classifier were to categorize it in the Ya- 
hoo! directory of topics, could it put it at the same place in the 
hierarchical index that the human indexers of Yahoo! would? 
We are less concerned with the choice of classifier and more 
interested in identifying the optimal descriptions for the cat- 
egories and for the webpages to be categorized. Although 
we use an n-gram based classifier called Telltale, we believe 
that another classifier could have been used for our experi- 
ments with possibly better results; our choice was based on 
the immediate availability of the software and the expertise 
of its developers. We first discuss some observations about 
Yahoo! (Section 2) that led to our idea to set up these experi- 
ments. In Section 3 we present the steps of our experiments. 
We continue to present our results (Section 4) and to discuss 
them (Section 5). Before concluding we present our ideas for 
further research in Section 6. 

2 Some observations about Yahoo! 

Yahoo! is an index of categories (topics), organized in a hi- 
erarchical manner. Let us look at the Yahoo! page of a par- 
ticular category. The following is a textual representation of 
what can be found (or, at least could be found at the time 
we collected our data) under http: //www. yahoo. com/ - 
Arts/Performing^Arts/Theater/Musicals/. Cate- 
gory names followed by an "@" are links to other Yahoo! 
categories, classified under a different path of the Yahoo! hi- 
erarchy (they are like links in the UNIX file-system); so, the 
Yahoo! hierarchy is more like a DAG (directed acyclic graph) 
than a tree. 

Top : Arts : Performing Arts : Theater : Mus icals 

. Options 

Search all of Yahoo 

Search only in Musicals 



* Indices (3) 

* Movies® ~ ~ 

* Shows (124) [new] 

* Songwriters® 

* Theater Groups (22) 

* Australian Musical Theater 

* Gilbert and Sullivan® 

* Jeff's Musical Page - for Les Miserables, 
Martin Guerre, and other popular musicals. 

* Just a Few Miles North of NYC - pictures 
and clips from favorite Broadway shows, 
original scripts, and a chat room to discuss 
theater. 

* MIT Musical Theater Guild Archives - 
synopses of musicals 

* Musical Cast Album Database - searchable 
database of musicals released on compact 
disc. 

* Musical Page - pictures and information 
from popular musicals. The Phantom of the 
Opera, Sunset Boulevard, and several more. 

* Musicals Home Page - an index to many 
Broadway musicals. 

* Rutger's Theatre Gopher 

* Tower Lyrics Archive - lyrics for several 
musicals 

* Ultimate Broadway Midi Page - midis from 
a plethora of Broadway shows, as well as 
librettos and synopsises. 

* Wisconsin Singers 

* Usenet - rec . arts . theatre .musicals 

We observe the following items of interest that are present on 
every Yahoo! page describing a Yahoo! category (topic) and 
the chosen entries categorized under this category: 

1 . First there is a category name which in the above exam- 
ple is: 

Top : Arts : Performing Arts : Theater : Musicals 

2. Another group of items is the sub-categories of the cur- 
rent category. 

* Movies® 

* Shows (124) [new] 

* Songwriter® 

* Theater Groups (22) 

These sub-categories (the children nodes of the current 
category) come in two varieties: (a) those that point to 
other categories of the Yahoo! hierarchy and are depicted 
with "@" following their name, and (b) those that are in- 
dexed under the current category. So, for the above set of 
sub-categories, only Shows and Theater Groups 
are direct children of Top : Arts : Performing 
Arts rTheater : Musicals and they are going to 
appear as such in the html document: 

<a href = n /text /Arts/Per f orming_Arts/ 

Theater/Musicals/Shows/ " > 
<a href='7text/Arts/Performing_Arts/ 

Theater/Musicals/Theater_Groups/ " > 

The other two categories (Movies® and Songwrit- 
ers®) as their corresponding URL's suggest, point to 
other places in the hierarchy 



<a href="/ text /Entertainment/ 

Movies_and_Films/Titles/Musicals/Shows " > 
<a href =" /text/Entertainment/Music/ 

Composition/Songwri ting/Songwriters/ ■ > 

3. The most important information is what we can call "se- 
mantic content" of the category, in other words the "con- 
tent" that offers an indirect "description" of the category: 

* Australian Musical Theatre 

. . . other omitted entries .... 

* Ultimate Broadway Midi Page - midis 
from a plethora of Broadway 
shows, as well as librettos 

and synopsises. 
. . . other omitted entries .... 

Every item here is a link outside Yahoo! Each entry is 
presented with a title, e.g., Ultimate Broadway- 
Midi Page, which could very well be the title field 
from the html document of the page, and is (optionally) 
accompanied by a brief description, e.g., midis 
from a plethora of Broadway shows, 
as well as librettos and synopsises, 
which is provided either by the human indexers or 
by the creator of the webpage when (s)he submitted 
it to Yahoo! for indexing. The latter element of the 
categorized entries is what we intend to take advantage 
of. 

In Table 1 we summarize various general terms and defi- 
nitions used in this document. We consider the Entries al- 
ready categorized under a particular Yahoo! category to be the 
material for the "description" of the category. Our main the- 
sis, is that these Entries provide us with the semantic con- 
tent of a Category, in the sense that if a new ENTRY were 
to be classified under that particular Category, its content 
would probably be similar to the content of the ENTRIES al- 
ready classified under that particular CATEGORY. Our exper- 
iments investigate the best way for describing CATEGORIES 
and Entries. Categories will be described using com- 
binations of features (EntryContent, EntryTitle, En- 
trySummary) of Entries that have already been classi- 
fied. ENTRIES will be described using one of their features 
(EntryContent, EntryTitle, EntrySummary). Our 
goal is to seek the most promising combination of descrip- 
tions for Category and Entry. The intuition we wished to 
explore was that the brief summaries accompanying Yahoo!- 
indexed entries offer a information-dense description of en- 
tries' content. 

3 An outline of our experiments 

Let us describe the phases of our experiments: 

Phase I We replicated the entire Yahoo! tree locally (approx- 
imately 500 MBytes). Some information relating to the num- 
ber of Yahoo! CATEGORIES and their respective sizes as of 



the time of the collection of our data can be found in Ta- 
ble 2 X . By creating a local copy of Yahoo!, we could store 
on our systems all the information necessary for our experi- 
ments, without the need for accessing the WWW every time 
we needed data. We used Wget 2 , a GNU network utility for 
retrieving files from the WWW, to download and replicate lo- 
cally the entire Yahoo! hierarchy. 

Phase II We generated the CATEGORYDESCRIPTION and 
the test cases (from here-on referred to as TestCases). 
In Section 2 we mentioned that there are a number of ele- 
ments that we can choose to construct the CategoryDe- 
SCRIPTION. For the round of experiments described here, 
we decided on three Types of CATEGORYDESCRIPTION: 
EntrySummaries+EntryTitles, EntrySummaries 
and EntrySummaries+EntryTitles+Category (see 
Table 3). We also had to make similar decisions regarding 
the test cases to be used in the experiments (the test cases 
were ENTRIES that were already categorized in Yahoo!). We 
used three different ways to describe them: EntryTitle, 
EntrySummary and EntryContent (see Table 3). 

The chosen TESTCASES were removed, i.e., were not ac- 
counted for as Entries when constructing the various Cat- 
EGORYDESCRIPTIONS. We used some simple heuristics in 
order to ensure an even distribution of a sufficient number 
of TestCases across the entire collection of Yahoo! Cate- 
gories (basically, we took into account the density of each 
top-level Category and we tried to allocate approximately 
10% of the Entries as TestCases for each top-level Cat- 
egory). 

Phase III We generated the corpus and ran the experiments. 

We used Telltale as our classifier. Telltale [11; 3; 2] was 
developed at the LABORATORY for ADVANCED INFORMA- 
TION Technology, at the CSEE Department of UMBC; 
among other things, Telltale can compute the similarity be- 
tween documents, using n-grams as index terms. The weight 
of each term is the difference between the count of a given 
n-gram for a document, normalized by its size, and the av- 
erage normalized count over all documents for that n-gram. 
This provides a weight for each n-gram in a document rela- 
tive to the average for the collection (corpus). The similarity 
between documents is then calculated as the cosine of the two 
representation vectors. 

Our goal was to generate a single corpus of all Yahoo! 
categories and the to run our experiment for each one of 
EntrySummaries+EntryTitles, EntrySummaries 
and EntrySummaries+EntryTitles+Category and 
for every set of TestCases of each type (EntryTitle, 
EntrySummary and EntryContent), for a total of 9 

'An interesting observation is the large number of CATEGORIES 
that appear to be indexed under the Regional top-level CATE- 
GORY (almost 3/4 of all the CATEGORIES). 

2 http : //www. Ins . comell . edu/public/COMP- 
/inf o/wget/wget_toc . html 



Term 


Definition 


Category 


a particular Yahpo! category (topic) 


ENTRY 


a categorized entry (some non- Yahoo! webpage) indexed in a CATEGORY 


CATEGORYNAME 


the full hierarchical name of a CATEGORY in Yahoo!, e.g., Top : Arts : Performing 
Arts : Theater : Musicals 


CATEGORY DESCRIPTION 


whatever constitutes the description of the category (see below for elements that can be used 
in the CategoryDescription of a Category) * 


EntryContent 


the html document that the ENTRY URL points to; a collection ot ENTRYCONTENT de- 
scriptions can be used in a Category Description 


entrytitle 


the title of an Entry that is often descriptive ot the content of the Entry, e.g., Musicals 
Home Page; a collection of ENTRYTlTLE descriptions can be used in a CateGORYDe- 
SCRIPTION 


ENTRYSUMMARY 


the brief textual description of an Entry that in the case of Yahoo! is generated by either 
the Yahoo! classifiers or by the human who submitted the page to Yahoo! for indexing, 
an index to many Broadway musicals; a collection of ENTRY SUMMARY 
descriptions can be used in a Category Description 



Table 1: Summary of terms and definitions used in this document. 



Top-level CATEGORY 


Number of topics 
(subcategories) 


Size (in KB) 


Arts 


2553 


9417 


Business and Economy 


13401 


91551 


Computers and Internet 


2357 


8549 


Education 


322 


1521 


Government 


3996 


27065 


Health 


1177 


4328 


News and Media 


1617 


6728 


Recreation 


5200 


18032 


Reference 


126 


556 


Regional 


113952 


324180 


Science 


2527 


899 


Social Science 


505 


1829 


Society and Culture 


2797 


11255 


Total 


151763 


518510 



Table 2: Summary of top-level Yahoo! Categories and their respective sizes. 



CategoryDescription 
types 


ENTRYSUMMARIES+ENTRYTlTLES 


the collection of the ENTRYSUMMARIES and 
EntryTitles for each Entry of a given 
Category 


ENTRY5UMMARIES 


the collection ot the ENTRYSUMMARIES tor 
each Entry of a given Category 


entrysummaries+entryutles+Category 


the combination ot ENTRYSUMMARIES+- 
EntryTitles and the CategoryName, 
Le„ the collection of the EntrySummaries 
and EntryTitles for each Entry of a given 
Category along with the CategoryName 
of the Category. 




ENTRYTlTLE 


we use the ENTRYTlTLE of the ENTRY 


ENTRYSUMMARY 


we use the ENTRYSUMMARIES ot the En- 
tries 


TESTCASES types 


ENTRYCONTENT 


we use the entrycontent ot the entry; 
we were careful to only select as TestCases 
those Entries that pointed to URLs that con- 
tained a sufficient amount of text (file size big- 
ger that IK discounting images, imagemaps, 
sound files, etc.) 



Table 3: Summary of terms and definitions related to the experiments. 



experiments (one for each combination of CATEGORY DE- 
SCRIPTION and EntryDescription). For each experi- 
ment we expected to compute the similarity of each Test- 
Case type against all CategoryDescription (of all Cat- 
egories) of a particular type, order them in descending order 
(using some cut-off point for the similarity measure) and fi- 
nally return the position of the correct match; the correct 
match is the Category under which the TestCase was 
actually classified in Yahoo! before being removed for the ex- 
periments. 

4 Experimental Results 

When we started Phase III we realized that Telltale was not 
up to the task of generating the huge corpora we needed for 
the experiments. Merging the corpora of each of the top-level 
Categories into a single Yahoo! corpus proved to be an in- 
surmountable obstacle. Since the new version of Telltale was 
under way we decided to modify our immediate goals and to 
postpone the full version of our experiment until the new and 
improved implementation of Telltale became available. In- 
stead of checking the test cases against the entire collection 
of Categories (a single corpus) we decided to run 3 ex- 
periment sets, with different combinations of top-level Cat- 
egories (so we generated 3 corpora of Yahoo! categories 
instead of one) and TestCases. More specifically, in each 
of these experiment sets, the TestCases were drawn from a 
different top-level Yahoo! category and matched against CAT- 
EGORIES from a single top-level CATEGORY {i.e., Health) 
or a combination of such (i.e., Education and Social 
Sciences), as summarized in Table 4. 

For each one of EDUversusEDU+SS, SSversus- 
EDU+SS and HEALTHversusHEALTH, we ran 9 
experiments, one for each combination of CATEGORY- 
DESCRIPTIONS (EntrySummaries+EntryTitles, 
EntrySumm aries and EntrySummaries+Entry- 
Titles+C ategory) and TestCases types (EntryTitle, 
EntrySummary and EntryContent), for a total of 27 
experiments. For each of the 27 experiments we returned 
2 results: (1) the percentage (and absolute number of test 
cases) of times that the correct match appeared first in the 
list returned by Telltale, and (2) the percentage (and absolute 
number of test cases) of times that the correct match, ap- 
peared in one of the first ten positions in the list returned by 
Telltale. Table 5 shows the results for all 9 experiments for 
EDUVERSUSEDU+SS; likewise for SSvERSUSEDU+SS 
and HEALTHversusHEALTH in Table 6 and Table 7, re- 
spectively. Finally, in Table 8 we present the averages across 
experiments EDUversusEDU+SS, SSversusEDU+SS 
and HEALTHversusHEALTH. 

After evaluating the results we can draw the following 
conclusions: (1) The most successful combination of Cat- 
egoryDescription and Entry descriptions is Entry- 
Summaries+EntryTitles with EntrySummary, Le., 
choosing the collection of the EntrySummaries and En- 



tryTitles for each Entry of a given Category to de- 
scribe the Category and choosing the EntrySummary of 
the Entry to describe the Entry (2) EntrySummaries+- 
EntryTitles outperforms all the other CategoryDe- 
SCRIPTIONS, regardless of the choice of entry description, 
and (3) EntrySummary outperforms all the other entry de- 
scriptions, regardless of the choice of CATEGORYDESCRIP- 
TION. 

The results seem to corroborate with one of our motivating 
intuitions for these experiments, i.e., that the brief summaries 
offer a very dense description of entries 7 contents. 

5 Discussion 

The purpose of this experiment was to automatically cate- 
gorize web-documents in the Yahoo! hierarchy. Researchers 
in the areas of Machine Learning and Information Retrieval 
have experimented with categorization into hierarchical in- 
dices. But our experiments are not comparable with the ones 
described in [6] and [14], for example, because of the differ- 
ence in the order of magnitude of the number of categories 
(less than 20 in [6], more than 1000 in our case) that we are 
attempting to match against. A fair evaluation of the results 
has to take into account the sheer number of categories been 
considered when a webpage is evaluated for classification. 

The only similar work we are aware of 3 is the Yahoo 
Planet project [8; 10] which uses the Yahoo! hierarchy of Web 
documents as a base for automatic document categorization. 
Several top categories are taken as separate problems, and 
for each an automatic document classifier is generated. A 
demo version of the system 4 enables automatic categoriza- 
tion of typed text inside the sub-hierarchy of a selected top 
Yahoo! category. Users can categorize whole documents by 
simply copying their content into a window and requesting 
categorization of the "typed" text. Their methodology dif- 
fers in that they built a classifier for each category which 
learns from positive (correctly indexed webpages) and neg- 
atives examples; unlike our method, they do not make use of 
the brief summaries of the categorized entries. This work re- 
lies on Machine Learning techniques and is part of a much 
larger endeavor [9]. In terms of comparing the results, one 
should keep in mind two basic differences: (a) a top level Ya- 
hoo! category has to be pre-selected (in experiments EDU- 
VERSUSEDU+SS and SSversusEDU+SS we use a com- 
bination of two top-level categories), and (b) their metric is 
slightly different than ours, i.e., they present the median of the 
correct category, e.g., a result of "median of rank of correct 
category" equal to 3, means that half of the testing examples 
are assigned a rank of 1, 2 or 3 [5]. In their experiments the 
medians for the top-level categories of References, Edu- 
cation and Computers and Internet, are 2, 3 and 3 

3 We were not aware of this work, at the time we conceived and 
ran our experiments. 

4 http://ml . i js .si/yquint/ycruint .exe; it does not 
seem to be running anymore. 





J ESTCASES 


CATEGORY 


EDUVERSUSEDU+SS 


from Education 


Education and Social Sciences 


SSVERSUSEDU+SS 


from Social Sciences 


Education and Social Sciences 


HEALTHversusHEALTH 


from Health 


Health 



Table 4: The three experiments we conducted 



II ENTRYSUMMARIES+EnTRY'I'ITLES 
! ENTRYSUMMARIES 

1 ENTRYSUMMARIES+ENTRYTiTLES+CaTEGORY 


EDUVERSUSEDU+SS 


C.NTR y 
1 

22 (37%) 
16(27%) 
20 (34%) 


' 1 ITLE 

l-IO 
38 (64%) 
30(51%) 
38 (64%) 


ENTRYS 
1 

35 (63%) 
16(29%) 
23 (42%) 


UMMARY 

1-10 
45 (82%) 
34 (62%) 
34 (62%) 


ENTRYC 
1 

13 (50%) 
9(35%) 
8 (29%) 


ONTENT 
1-10 
18(69%) 
14 (54%) 
15(54%) 



Table 5: Results from EDUversusEDU+SS; the corpus is comprised from the top-level CATEGORIES of Education 
and Social Sciences and the TestCases are drawn from Education. We provide the absolute numbers and the 
percentages of the TestCases that were returned in the top position (1) of the list returned and in the (1-10) range. 







55VERSUSfcDU+SS 








ENTRY 1 ITLE 


hNTRY SUMMARY 


fcNTRYUONTENT 






1-10 


1 


1 10 


1 


1-10 


II ENTRY SUMMARIES+ENTRYTlTLES 


1 1 (20%) 


25 (46%) 


37 (82%) 


44(98%) 


8(40%) 


17(85%) 


hNTRY Summaries 


7 (13%) 


20(37%) 


10(22%) 


25 (56%) 


4 (20%) 


12(60%) 


|| ENTRYSUMMARIES+ENTRYMTLES+CaTEGORY 


16(30%) 


24 (44%) 


23 (51%) 


36 (80%) 


7(35%) 


13(65%) 



Table 6: Results from SSversusEDU+SS; the corpus is comprised from the top-level CATEGORIES of Education and 
Social Sciences and the TestCases are drawn from SocialSciences. We provide the absolute numbers and the 
percentages of the TestCases that were returned in the top position (1) of the list returned and in the (1-10) range. 





HfcALl HVERSUSHfcALl H 




HNTRY 1 ITLE 


ENTRYSUMMARY 


ENTRYCJONTENT 




I 


l-IO 


1 


1-10 


I 


1-10 


ENTRYSUMMARIES+ENTRYTlTLES 


46 (37%) 


75 (60%) 


90(75%) 


114(95%) 


30 (43%) 


55 (80%) 


entry Summaries 


32 (26%) 


71 (57%) 


30 (30%) 


78 (65%) 


21 (30%) 


41 (59%) 


entrySummaries+Entrytitles+Category 


56(45%) 


81 (65%) 


60(50%) 


88 (73%) 


19(29%) 


46 (70%) 



Table 7: Results from HEALTHversusHEALTH; the corpus is comprised from the top-level Category of Health and 
the TestCases are drawn from Health. We provide the absolute numbers and the percentages of the TestCases that were 
returned in the top position (1) of the list returned and in the (1-10) range. 





All Experiments 




ENTRY 1 ITLE 


ENTRYS UMMARY 


ENTRYCJONTENT 




I 


l-IO 


1 


1-10 


1 


1-10 


ENTRYSUMMARIES+ENTRY 1 1TLES 


31% 


57% 


73% 


92% 


44% 


78% 


ENTRYSUMM ARIES 


22% 


48% 


27% 


61% 


28% 


58%~ 


hNTRYSUMM ARIES+ENTRY 1 ITLES+CATEGORY 


36% 


58% 


48% 


72% 


31% 


63% 



Table 8: Averages of the percentages from the results from EDUversusEDU+SS, SSversusEDU+SS and HEALTH- 
versusHEALTH. 



respectively. By comparison, the results of Table 5, where the 
test cases are drawn from Education and matched against 
the combined top-level categories of Education and So- 
cial Sciences suggest a median of 1 (since 50% of the 
test cases have a rank of 1), for the case of EntryContent 
(which is equivalent to their "description" of the test case). 
But again, an one-on-one comparison is impossible. We only 
consider test cases that have enough text in them and although 
they also employ similar criteria to make sure that a webpage 
has enough text to work with, any comparison will be incom- 
plete and inaccurate unless we attempt to categorize exactly 
the same set of test cases. 

If a webpage (or documents) were to be classified automat- 
ically one would expect 100% accuracy by the classifier. In 
that sense, ours is a failed experiment. With respect to a fully 
automatic categorization of webpages, our approach presents 
an additional shortcoming: the best performance occurs when 
some brief textual description of the webpage is used, as is 
the case with most of the webpages categorized in Yahoo!. If 
webpages are to be categorized without human intervention, 
no such brief description is expected to be provided. It is 
quite surprising though, how encouraging the results are even 
when just a few words are available. On the other hand, it 
seems that the collection of EntrySummary and Entry- 
Titles (EntrySummaries+EntryTitles) is extremely 
powerful in terms of describing the content of a particular 
CATEGORY. An observation in favor of our results is that 
we take the Yahoo! indexing to be the absolute and only cor- 
rect categorization of a document. In other words, we do 
not investigate whether the matches returned by our classifier 
are reasonable or correct matches, even if the Yahoo! index- 
ers thought otherwise (perhaps because of the additional time 
needed to classify a webpage in multiple locations in the hi- 
erarchy). Finally, we discount as false a result that returns a 
Category that even though is not the correct one is pretty 
close (semantically) to it. 

Maybe a proper evaluation of the results depends on the 
potential use of this technique. Inadequate as it might be for 
a strictly automated categorization of webpages, it could be 
useful for offering suggestions to a human indexer. If though, 
the owner of the webpage is willing to provide a very brief 
account of the webpage, our method could be useful for au- 
tomatic categorization. Finally, if the method is used for au- 
tomatically tagging webpages (or documents) in order to se- 
mantically describe their content, the error might be within 
acceptable range for the purpose. 

6 Future work 

Our next goal is to experiment with the new version of Tell- 
tale which will allow us to test the TestCases against a cor- 
pus of all the Yahoo! topics minus the Regional category (a 
total of approximately 38,000 categories). One of our obser- 
vations about Yahoo! is that 3/4 of its topics are indexed un- 
der the Regional top-level category. It seems that most of 



the topics indexed somewhere in a Regional sub-category 
could have also been indexed under another top-level cate- 
gory but they do not appear there too. For example, imagine 
some small-town real estate agency which is indexed under 
the real-estate businesses of the small town's CATEGORY un- 
der Regional but not under real-estate businesses, under 
the top-level Business and Economy category. Our ex- 
periments so far dealt with only 1000 topics, so we do not 
know what to expect after a one to two orders of magnitude 
increase. Intuitively, we expect that our current results con- 
stitute a best-case upper bound for future results. 

Another direction for future experimentation would be to 
experiment with other classifiers. We used n-grams and 
Telltale because the system was readily available to us and 
we had immediately available expertise on how to use it 
for our purposes. We want to experiment with a term 
frequency/inverse document frequency (TF/IDF) weighting 
scheme for Telltale; [7] suggests that TF/IDF outperforms the 
centroid weighting method that Telltale currently employs. 
It would also be worth investigating classifiers that take into 
consideration the hierarchical structure of the Yahoo! topics, 
a future that we did not explore in our experiments. We would 
like to improve the performance of the EntryContent type 
of an entry's description. This would be crucial if we were 
to use the technique for automatic categorization, since in 
this case we can only rely on the html content of the web- 
document (or the text of a document, in general). So far, our 
approach with the html content was very basic. Other than 
making sure that there was enough textual content present, 
we did not further manipulate its content. 

Finally, we would like to re -consider the evaluation of the 
matches returned by the classifier. Some of the top matches 
might not be the "perfect" match, i.e., the official Yahoo! cate- 
gorization of a given webpage but they might be close enough 
to the perfect match in the huge Yahoo! DAG, to be useful 
for providing some sort of semantic information about the 
content of the webpage (less accurate but still useful). Also, 
besides considering such "approximate" matches, it would 
be interesting to have people evaluate the results returned 
by the classifier. Just because a webpage was classified by 
the Yahoo! human indexers in a particular category, this does 
not mean that other possible correct categories do not exist, 
some of which might have been returned by our classifier. So, 
we would like to have human indexers evaluate the accuracy 
of the returned matches without knowledge of which match 
might have been the Yahoo! one. We want to re-evaluate the 
performance of our method under such revised metrics. 

7 In conclusion 

In this paper we presented a claim and a set of experiments. 
The claim was that one could use the pre-defined categories 
of one of many WWW indexers to describe the nature or the 
content of "things". Although of all "things" we focused on 
documents we believe that such categories can be used to de- 



scribe a large range of activities, objects, etc. Our experi- 
ments and the success thereof is independent of the claim, 
which by itself we did not validate feeling that the useful- 
ness of such a standardized way of referring to or describing 
"things" is rather obvious for computer applications. Our ex- 
periments investigated the automation of the process of find- 
ing the correct description, i.e., a WWW indexer's category 
(specifically a Yahoo! category), to describe a particular kind 
of "thing", Le., a webpage. In principle, little would change 
if instead of a webpage we had chosen a document that fo- 
cuses on some particular topic. Our results indicated that the 
specific method we used (using a classifier called Telltale) 
cannot be used alone to automatically categorize documents, 
if the actual text of the document is used for the classifica- 
tion. One of our main observations though was that a very 
brief description of the document dramatically improves the 
effectiveness of the classification. So, given our working as- 
sumption that automatic classification would require almost 
a 100% accuracy we believe that the best use of our method 
would be in conjunction with a human to which our classi- 
fier would offer recommendations. One other important result 
was that the collection of the brief summaries that accompany 
the indexed (under a particular category) webpages in Yahoo! 
are extremely useful in capturing what a category is about. 
This result might be of interest to other researchers interested 
in similar problems. 
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